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ABSTRACT 

Incomplete  data  occur  commonly  in  the  application  of  statistics  to  real 
data,  and  there  exists  a  substantial  literature  on  the  problem.  This  article 
provides  an  overview  of  issues  for  the  statistically  knowledgeable  but  not 
necessarily  statistically  sophisticated  reader.  It  is  intended  to  appear  as 
an  entry  in  The  Encyclopedia  of  Statistical  Sciences. 


Ac  con'',  ion  For* 
G'A'Jkl 


AMS  (MOS)  Subject  Classifications:  6207,  62D05,  62H12 

Key  Words:  Missing  Data,  EM  Algorithm,  Nonresponse,  Censored  Data,  Truncated 
Data,  Imputation,  Factorizing  Likelihoods 
Work  Unit  Number  3  -  Statistics  and  Probability 


Sponsored  by  the  United  States  Army  under  Contract  No.  DAAG29-80-C-004 1 . 


INCOMPLETE  DATA  -  ENCYCLOPEDIA  ENTRY 
Roderick  J.  A.  Little  and  Donald  B.  Rubin 

1 •  Introduction 

Incomplete  data  is  an  extremely  general  problem  in  statistics.  Indeed, 
one  might  view  inferential  statistics  in  general  as  a  collection  of  methods 
for  extending  inferences  from  a  sample  to  a  population  where  the  non-sampled 
values  are  regarded  as  missing  data. 

Although  some  statistical  methods  for  complete  data,  such  as  factor 
analysis,  finite  mixture  models,  and  mixed  model  analysis  of  variance  can  be 
usefully  viewed  as  incomplete  data  methods  (Dempster,  Laird  and  Rubin,  1977), 
we  restrict  this  review  to  more  standard  incomplete  data  problems.  For  the 
class  of  problems  reviewed  here,  we  consider  "missing  data"  to  be  synonymous 
with  "incomplete  data."  After  describing  common  examples  with  missing  data  in 
Section  2,  in  Section  3  we  describe  techniques  for  handling  these  problems. 

In  Section  4  we  discuss  the  EM  algorithm,  an  ubiquitous  algorithm  for  finding 
maximum  likelihood  (m.1.)  estimates  from  incomplete  data.  Useful  reviews  of 
the  analysis  of  incomplete  data,  are  given  in  Afifi  and  Elashoff  (1966), 
Hartley  and  Hocking  (1971),  Orchard  and  Woodbury  (1972),  Dempster,  Laird  and 
Rubin  (1977),  and  Little  (1982). 

2 •  Common  Incomplete  Data  Problems 

We  first  consider  problems  where  missing  values  are  confined  to  a  single 
outcome  variable  y,  and  interest  concerns  the  distribution  of  y,  perhaps 
conditional  on  a  set  of  one  or  more  predictor  variables  x,  that  are  recorded 
for  all  units  in  the  sample.  Sometimes  we  have  no  information  about  the 
missing  values  of  y;  at  other  times  we  may  have  partial  information,  for 
example,  that  they  lie  beyond  a  known  censoring  point  c. 
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Mechanisms  Leading  to  Missing  Values 


Any  analysis  of  incomplete  data  requires  certain  assumptions  about  the 
distribution  of  the  missing  values,  and  in  particular  how  the  distributions  of 
the  missing  and  observed  values  ot  a  variable  are  related.  The  work  of  Rubin 
(1976a)  distinguishes  three  cases.  If  the  process  leading  to  missing  y 
values  (and  in  particular,  the  probability  that  a  particular  value  of  y  is 
missing)  does  not  depend  on  the  values  of  x  or  y,  then  the  missing  data 
are  called  missing  at  random  and  the  observed  data  are  observed  at  random.  If 
the  process  depends  on  observed  values  of  x  and  y  but  not  on  missing 
values  of  y  the  missing  data  are  called  missing  at  random,  but  the  observed 
data  are  not  observed  at  random.  If  the  process  depends  on  missing  values 
of  y  then  the  missing  data  are  not  missing  at  random;  in  this  case, 
particular  care  is  required  in  deriving  inferences.  Rubin  (1976a)  formalizes 
these  notions  by  defining  a  random  variable  m  that  indicates  for  each  unit 
whether  y  is  observed  or  missing,  and  relating  these  conditions  to 
properties  of  the  conditional  distribution  of  m  given  x  and  y. 

2.2  Analysis  of  Variance 

The  first  incomplete  data  problem  to  receive  systematic  attention  in  the 
statistics  literature  is  that  of  missing  data  in  designed  experiments;  in  the 
context  of  agricultural  trials,  this  problem  is  often  called  the  missing  plot 
problem  (Bartlett,  1937;  Anderson,  1946).  Designed  experiments  investigate 
the  dependence  of  an  outcome  variable,  such  as  yield  of  a  crop,  on  a  set  of 
factors,  such  as  variety,  type  of  fertilizer  and  temperature.  Usually  an 
experimental  design  is  chosen  that  allows  efficient  estimation  of  important 
effects  as  well  as  a  simple  analysis.  The  analysis  is  especially  simple  when 
the  design  matrix  is  easily  inverted,  as  with  complete  or  fractional 
replications  of  factorial  designs.  The  missing  data  problem  arises  when  at 


the  conclusion  of  the  experiment,  the  values  of  the  outcome  variable  are 
missing  for  some  of  the  plots,  perhaps  because  no  values  were  possible,  as 
when  particular  plots  were  not  amenable  to  seeding,  or  because  values  were 
recorded  and  then  lost*  Standard  analyses  of  the  resultant  incomplete  data 
assume  the  missing  data  are  missing  at  random,  although  in  practical 
situations  the  plausibility  of  this  assumption  needs  to  be  checked.  The 
analysis  aims  to  exploit  the  "near-balance"  of  the  resulting  data  set  to 
simplify  computations.  For  example,  one  tactic  is  to  substitute  estimates  of 
the  missing  outcome  values  and  then  to  carry  out  the  analysis  assuming  the 
data  to  be  complete.  Questions  needing  attention  then  address  the  choice  of 
appropriate  values  to  substitute  and  how  to  modify  subsequent  analyses  to 
allow  for  such  substitutions.  For  discussions  of  this  and  other  approaches, 
see  Healy  and  Westmacott  (1956),  Wilkinson  (1958),  and  Rubin  (1972,  1976b). 

2.3  Censored  or  Truncated  Outcome  Variable 

We  have  noted  that  standard  analyses  for  missing  plots  assume  that  the 
missing  data  are  missing  at  random,  that  is,  the  probability  that  a  value  is 
miss-ng  can  depend  on  the  values  of  the  factors  but  not  on  the  missing  outcome 
values.  This  assumption  is  violated,  tor  example,  when  the  outcome  variable 
measures  time  to  an  event  (such  as  death  of  an  experimental  animal,  failure  of 
a  light  bulb),  and  the  times  for  some  units  are  not  recorded  because  the 
experiment  was  terminated  before  the  event  had  occurred;  the  resulting  data 
are  censored.  In  such  cases  the  analysis  must  include  the  information  that 
the  units  with  missing  data  are  censored,  since  if  these  units  are  simply 
discarded,  the  resulting  estimates  can  be  badly  biased. 

The  analysis  of  censored  samples  from  the  Poisson,  binomial  and  negative 
binomial  distributions  is  considered  by  Hartley  (1958).  Other  distributions, 
including  tne  normal,  log-normal,  exponential,  gamma,  Weibull,  extreme  value 


and  logistic  are  covered  most  extensively  in  the  life  testing  literature  (for 
reviews,  see  Mann,  Schafer  and  Singpurwalla,  1974;  Tsokos  and  Shimi,  1977). 
Non-parametric  estimation  of  a  distribution  subject  to  censoring  is  carried 
out  by  life  table  methods,  formal  properties  of  which  are  discussed  by  Kaplan 
and  Meier  (1958).  Much  of  this  work  can  be  extended  to  handle  covariate 
information  (Glasser,  1969;  Cox,  1972;  Aitkin  and  Clayton,  1980;  Laird  and 
Oliver,  1981).  The  EM  algorithm,  discussed  here  in  Section  4,  is  a  useful 
computational  device  for  such  problems. 

A  variant  of  censored  values  occurs  when  missing  values  are  known  to  lie 
within  an  interval,  as  when  the  data  are  available  in  grouped  form.  The 
analysis  of  grouped  data  is  discussed  by  Hartley  (1958),  Kuldorff  (1961)  and 
Blight  (1970),  among  others.  Another  variant  of  censored  data  occurs  when  the 
number  of  censored  values  is  unknown.  The  resulting  data  are  called 
truncated,  since  they  can  be  regarded  as  a  sample  from  a  truncated 
distribution.  A  considerable  literature  exists  for  this  form  of  data 
(Hartley,  1958;  Dempster,  Laird  and  Rubin,  1977;  Blumenthal,  Dahiya  and  Gross, 
1978). 

2.4  Sample  Survey  Data 

For  the  data  types  discussed  in  Section  2.4,  the  missing  data  are  not 
missing  at  random,  but  the  mechanisms  leading  to  incomplete  data  are  assumed 
known.  For  example,  the  censoring  points  for  censored  observations  are 
known.  A  common  and  somewhat  more  intractable  problem  occurs  when  the  missing 
data  are  not  missing  at  random  and  the  mechanism  leading  to  missing  data  is  at 
best  partially  known.  Incomplete  data  arising  from  nonresponse  in  sample 
surveys  provide  an  illustration  of  this  kind  of  problem.  For  example, 
nonresponse  to  a  question  on  household  income  often  depends  on  the  amount  of 
that  income,  in  an  unknown  way.  Restricting  the  analysis  to  respondents 
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clearly  leads  to  bias  in  such  situations;  given  the  large  samples  often 
available  in  survey  work,  this  bias  is  frequently  more  important  than  the  loss 
of  efficiency  of  estimation  arising  from  the  reduction  in  sample  size. 

The  effect  of  survey  nonresponse  is  minimized  by:  (a)  designing  data 
collection  methods  to  minimize  the  level  of  nonresponse,  (b)  interviewing  a 
subsample  of  nonrespondents,  and  (c)  collecting  auxilliary  information  on 
nonrespondents  and  employing  analytical  methods  that  use  this  information  to 
reduce  nonresponse  bias*  Models  for  nonrandomly  missing  data,  as  developed  by 
Nelson  (1976),  Heckman  (1976)  and  Rubin  (1977),  can  also  be  applied  here. 
Estimates  derived  from  these  models,  however,  are  sensitive  to  aspects  of  the 
model  that  cannot  be  tested  with  the  available  data  (Rubin,  1978;  Little, 

1982;  Greenlees,  Reece,  and  Zieschang,  1982).  A  thorough  discussion  of  survey 
nonresponse  is  given  in  the  work  of  the  National  Academy  of  the  Sciences  Panel 
on  Incomplete  Data  (National  Academy  of  the  Sciences,  1982). 

2. 5  Multivariate  Incomplete  Data 

The  incomplete  data  structures  discussed  so  far  are  univariate,  in  the 
sense  that  the  missing  values  are  confined  to  a  single  outcome  variable.  We 
now  turn  to  incomplete  data  structures  that  are  essentially  multivariate  in 
nature. 

Many  multivariate  statistical  analyses  including  least  squares 
regression,  factor  analysis  and  discriminant  analysis  are  based  on  an  initial 
reduction  of  the  data  to  the  sample  mean  vector  and  covariance  matrix  of  the 
variables.  The  question  of  how  to  estimate  these  moments  with  missing  values 
in  one  or  more  of  the  variables  is,  therefore,  an  important  one.  Early 
literature  was  concerned  with  small  numbers  of  variables  (two  or  three)  and 
simple  patterns  of  missing  data  (Anderson,  1967;  Afifi  and  Elashoff,  1966). 


Subsequently,  more  extensive  data  sets  with  general  patterns  of  mi:,*.  j 
were  addressed  (Buck,  1960;  Orchard  and  Woodbury,  1972?  Trawinski  and 
Bargmann,  1972;  Rubin,  1974?  Beale  and  Little,  1975;  Little,  1976). 

The  reduction  to  first  and  second  moments  is  generally  not  appropriate 
when  the  variables  are  categorical.  In  this  case,  the  data  can  be  expressed 
in  the  form  of  a  multiway  contingency  table.  Most  of  the  work  on  incomplete 
contingency  tables  has  concerned  maximum  likelihood  estimation  assuming  a 
Poisson  or  multinomial  distribution  for  the  cell  counts.  Bivariate 
categorical  data  form  a  two-way  contingency  table?  if  some  observations  are 
available  on  a  single  variable  only,  then  they  can  be  displayed  as  a 
supplemental  margin.  The  analysis  of  data  with  supplemental  margins  is 
discussed  by  Hocking  and  Oxspring  (1974)  and  Chen  and  Fienberg  (1974). 
Extensions  to  log-linear  models  for  higher  way  tables  with  supplemental 
margins  are  discussed  in  Fuchs  (1982). 

Essentially,  all  of  the  literature  on  multivariate  incomplete  data 
assumes  that  the  missing  data  are  missing  at  random,  and  much  of  it  also 
assumes  that  the  observed  data  are  observed  at  random.  Together  these 
assumptions  imply  that  the  process  that  creates  missing  data  does  not  depend 
on  any  values,  missing  or  observed. 

3.  Methods  for  Handling  Incomplete  Data 

3.1  A  Broad  Taxonomy  of  Methods 

Methods  for  handling  incomplete  data  generally  belong  to  one  or  more  of 
the  following  categories: 

(i)  Methods  that  discard  units  with  data  missing  in  some  variables  and 
analyze  only  the  units  with  complete  data  (for  example,  Nie  et  al, 
1975). 
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(ii)  Imputation  based  procedures.  The  missing  values  are  filled  in  and 
the  resultant  completed  data  are  analyzed  by  standard  methods. 

For  valid  inferences  to  result,  modifications  to  the  standard 
analyses  are  required  to  allow  for  the  differing  status  of  the 
real  and  the  imputed  values.  Commonly  used  procedures  for 
imputation  include  hot  deck  imputation  (c.f.,  Ford,  1981),  where 
recorded  units  in  the  sample  are  substituted,  mean  imputation, 
where  means  from  sets  of  recorded  values  are  substituted  and 
regression  imputation,  where  the  missing  variables  for  a  unit  are 
estimated  by  predicted  values  from  regression  on  the  known 
variables  for  that  unit  (Buck,  1960).  A  variant  of  imputation 
methods  produces  multiple  imputations  for  each  missing  value  and 
thereby  allows  simple  adjustments  to  be  made  to  reflect  the 
differing  status  of  real  and  imputed  values  (Rubin,  1978,  1980). 
(iii)  Weighting  procedures.  Randomization  inferences  from  sample  survey 
data  without  nonresponse  are  commonly  based  on  design  weights, 
which  are  inversely  proportional  to  the  probability  of 
selection.  For  example,  let  y^  be  the  value  of  a  variable  y 
for  unit  i  in  the  population.  Then,  the  population  mean  is 
often  estimated  by 

E  ( 1 ) 

where  the  sums  are  over  sampled  units,  is  the  probability  of 

selection  for  unit  i  and  w  ^  is  the  design  weight  for  unit  i. 

Weighting  procedures  modify  the  weights  to  allow  for 
nonresponse.  The  estimator  (1)  is  replaced  by 

E(1Ti^i)~1yi  /  '  (2) 

where  the  sums  are  now  over  sampled  units  which  respond,  and  p^ 


-7- 


is  an  estimate  of  the  probability  of  response  for  unit  i , 
usually  the  proportion  of  responding  units  in  a  subclass  of  the 
sample.  Weighting  is  related  to  mean  imputation?  for  example,  if 
the  design  weights  are  constant  in  subclasses  of  the  sample,  then 
imputing  the  subclass  mean  for  missing  units  in  each  subclass,  or 
weighting  responding  units  by  the  proportion  responding  in  each 
subclass,  lead  to  the  same  estimates  of  population  means,  although 
not  the  same  estimates  of  sampling  variance  unless  adjustments  are 
made  to  the  data  with  means  imputed.  A  recent  discussion  of 
weighting  with  extensions  to  two  way  classifications  is  provided 
by  Scheuren  { 1 982 ) • 

(iv)  Model-based  procedures.  A  broad  class  of  procedures  is  generated 
by  defining  a  model  for  the  incomplete  data  and  basing  inferences 
on  the  likelihood  under  that  model,  with  parameters  estimated  by 
procedures  such  as  maximum  likelihood.  Advantages  of  this 
approach  are:  flexibility/  the  avoidance  of  adhocery,  in  that 
model  assumptions  underlying  the  resulting  methods  can  be 
displayed  and  evaluated?  and  the  availability  of  large  sample 
estimates  of  variance  based  on  second  derivatives  of  the  log- 
likelihood,  which  take  into  account  incompleteness  in  the  data. 
Disadvantages  are  that  computational  demands  can  be  large, 
particularly  for  complex  patterns  of  missing  data,  and  that  little 
is  known  about  the  small  sample  properties  of  many  of  the  large 
sample  approximations. 

3.2  The  Modelling  Approach  to  Incomplete  Data 

Any  procedure  that  attempts  to  handle  incomplete  data  must,  either 


implicitly  or  explicitly,  model  the  process  that  creates  missing  data.  We 
prefer  the  explicit  approach  since  assumptions  are  then  clearly  stated. 


The  parametric  form  of  the  modelling  argument  can  be  expressed  as  follows 


(Rubin,  1976a),  Let  denote  data  that  are  present  and  ym  data  that  are 

hissing.  Suppose  that  y  *  (y0#Ym)  has  a  distribution  f(y  ,y  |9)  indexed 

P  pm 

by  an  unknown  parameter  6,  If  the  missing  data  are  missing  at  random,  then 

the  likelihood  of  9  given  data  y  is  proportional  to  the  density  of  y  , 

P  p 

obtained  by  integrating  f(y  ,y  |9)  over  y  : 

pm  m 

L(9|y)«/f(y,y|6)dy  .  (3) 

p  p  m  m 

Likelihood  inferences  are  based  on  L(0|y  ).  Occasionally  in  the  literature, 

P 

the  missing  values  ym  are  treated  as  fixed  parameters,  rather  than 

integrated  out  of  the  distribution  f(y  ,y  |0),  and  joint  estimates  of  0 

p  m 

and  ym  are  obtained  by  maximizing  f(y^,ym;9)  with  respect  to  9  and  ym 
(e.g.  Press  and  Scott,  1976  present  a  procedure  which  is  essentially 
equivalent  to  this).  This  approach  is  not  recommended  since  it  can  produce 
badly  biased  estimates  which  are  not  even  consistent  unless  the  fraction  of 
missing  data  tends  to  zero  as  the  sample  size  increases.  Also,  the  model 
relating  the  missing  and  observed  values  of  y  is  not  fully  exploited,  and  if 
the  amount  of  missing  data  is  substantial,  the  treatment  of  ym  as  a  set  of 
parameters  contradicts  the  general  statistical  principle  of  parsimony, 

,.i  important  generalization  of  (3)  is  to  include  in  the  model  the 
distribution  of  a  vector  of  variables  indicating  whether  a  value  is  observed 
or  missing.  The  full  distribution  can  be  specified  as 

(4) 

where  9  is  the  parameter  of  interest  and  $  relates  to  the  mechanism 
leading  to  missing  data.  This  extended  formulation  is  necessary  for 
nonrandomly  missing  data  such  as  arise  in  censoring  problems. 


f(m*y  »y  l0#4>)  =  f(y  #y  |9)f(m|y  ,y  ,<t>)  , 

pm  pm  pm 
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To  illustrate  (3)  and  (4),  suppose  the  hypothetical  complete  data  y  35 
(y1#...,yn)  is  a  random  sample  of  size  n  from  the  exponential  distribution 
with  mean  0,  Then 

f(y  #y  |0)  =  0  n  exp(-t  /0)  , 

pm  n 

n 

where  t  =  £  y .  is  the  total  of  the  n  sampled  observations*  If  r  <  n 

i=1  1 

observations  are  present  and  the  remaining  n-r  are  missing/  then  the 
likelihood  ignoring  the  response  mechanism  is  proportional  to  the  density 

f (yp|9)  =  0  r  exp( -t^/0 )  ,  (5) 

regarded  as  a  function  of  0,  where  tr  is  the  total  of  the  recorded 
observations* 

Let  m  =  (m<| ,  •  •  •  ,mn)  where  =  1  or  0  as  y^  is  recorded  or 

missing,  respectively,  r  3  ^  m. ,  We  consider  two  models  for  the  distribution 
of  m  given  y.  First,  suppose  observations  are  independently  recorded  or 
missing  with  probability  Then 

f(m|y,<(.)  =  4>r{1-4»)n~r  , 

and 

f (yp,m 1 0 ,<(> )  =  4>r(1-<J>)n“re_r  exp(-tr/0)  .  (6) 

The  likelihoods  based  on  (5)  and  (6)  differ  by  a  factor  $r(1-$)n  r  which 
does  not  depend  on  0,  provided  that  0  and  $  are  distinct,  that  is  their 
joint  parameter  space  factorizes  into  a  9-space  and  a  $-space*  Hence  we 
can  base  inferences  on  (5),  ignoring  the  response  mechanism. 

Suppose  instead  that  the  sample  is  censored,  in  that  only  values  less 
than  a  known  censoring  point  c  are  observed.  Then 

n 

f (m|y,$)  -  n  f (m. |y . )  , 

i=1  1  1 
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f (milyi) 


if  m.  =  1  and  y.  <  c  or  m.  =  0  and  y.  <  c  ; 

i  J  i  1  l 

otherwise 


The  full  likelihood  is  then  proportional  to 


f(y  ,m|0)  =*  l  f (y  |0)f (m  |y  <  c)  £  pr(y  >  c|0) 
P  ^  1  i  i  ,  1 


ismi=1 


=  0  r  expC-t^/0)  exp[-(n-r)c/0]  • 


i :m^=0 


(7) 


In  this  case  the  response  mechanism  is  not  ignorable,  and  the  likelihoods 
based  on  (5)  and  (7)  differ.  In  particular,  the  maximum  likelihood  estimate 
of  9  based  on  (5)  is  tr/r,  the  mean  of  the  recorded  observations,  which  is 
less  than  the  correct  maximum  likelihood  estimate  of  0  based  on  (7),  namely 
It  +( n-r)c] /r*  The  latter  estimate  has  the  simple  interpretation  as  the  total 
time  at  risk  for  the  uncensored  and  censored  observations  divided  by  the 
number  of  failures  (r). 
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3* 3  Special  Data  Patterns:  Factoring  the  Likelihood 
For  certain  special  patterns  of  multivariate  missing  data,  maximum 
likelihood  estimation  can  be  simplified  by  factoring  the  joint  distribution  in 
a  way  which  simplifies  the  likelihood.  Suppose  for  example  the  data  have  the 
monotone  or  nested  pattern  in  Figure  1 ,  where  y^  represents  a  set  of 
variables  observed  for  the  same  set  of  observations  and  y^  is  more  observed 
than  Vj+i/  3  *  1...,J-1.  The  joint  distribution  of  yj**»yj  can  be  factored 
in  the  form 

f{yi . yjl0)  =  ***  f(yjlyi . yj-i'V  ' 

where  f^  denotes  the  conditional  distribution  of  y^  given 

indexed  by  parameters  0..  If  the  parameters  0  ,•••&  are  distinct,  then 

3  I  J 

the  likelihood  of  the  data  factors  into  distinct  complete-data  components, 

leading  to  simple  maximum  likelihood  estimators  for  0  (Anderson,  1957; 

Rubin,  1974).  Maximum  likelihood  estimation  with  more  general  patterns  of 

incomplete  data  can  be  accomplished  by  the  EM  algorithm. 

4.  General  Data  Patterns:  The  EM  Algorithm 

The  expectation-maximization  (EM)  algorithm  (Dempster,  Laird  and  Rubin, 

1977)  is  an  iterative  method  of  maximum  likelihood  estimation  that  applies  to 

any  pattern  of  missing  data.  Let  £(0|y  ,y  )  denote  the  log-likelihood  of 

p  m 

parameters  0  based  on  the  hypothetical  complete  data  (yp'ym)#  Let 

denote  an  estimate  of  0  after  iteration  i  of  the  algorithm.  The 

(i  +  1)th  iteration  consists  of  an  E-step  and  an  M-step.  The  E-step  consists 

of  taking  the  expectation  of  it ( 0 1 y  ,y  )  over  the  conditional  distribution 

P  m 

of  ym  given  yp,  evaluated  at  0  =  0^*.  That  is,  the  averaged 
loglikelihood 

**<e|y.0(1))  =  /  *<elyB»yJ*<yJy  .e(1))dv 

p  p  m  m  p  m 

is  formed. 
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The  M-step  consists  in  finding  0^  +  the  value  of  0  which 

maximizes  i  •  This  new  estimate,  0^  +  then  replaces  0^  at  the 

next  iteration.  Each  step  of  EM  increases  the  loglikelihood  of  0  given 

y  •  A ( 0 | y  ).  Under  quite  general  conditions,  the  algorithm  converges  to  a 
P  P 

maximum  value  of  the  loglikelihood  1(0  |y  ).  In  particular,  if  a  unique 

P 

finite  maximum  likelihood  estimate  of  0  exists,  the  algorithm  finds  it. 

An  important  case  occurs  when  the  complete  data  belong  to  a  regular 
exponential  family.  In  this  case,  the  E-step  reduces  to  estimating  the 
sufficient  statistics  corresponding  to  the  natural  parameters  of  the 
distribution.  The  M-step  corresponds  to  maximum  likelihood  estimation  from 
the  hypothetical  complete  data,  with  the  sufficient  statistics  replaced  by  the 
estimated  sufficient  statistics  from  the  E-step. 

The  EM  algorithm  was  first  introduced  for  particular  problems  (e.g.. 
Hartley,  1958,  for  counted  data  and  Blight,  1970,  for  grouped  or  censored 
data).  The  regular  exponential  family  case  was  presented  by  Sundberg  (1974)* 
Orchard  and  Woodbury  (1972)  discussed  the  algorithm  more  generally,  using  the 
term  "missing  information  principle"  to  describe  the  link  with  the  complete- 
data  loglikelihood.  Dempster,  Laird  and  Rubin  (1977)  introduced  the  term  EM, 
developed  convergence  properties  and  provided  a  large  body  of  examples. 

Recent  applications  include  missing  data  in  discriminant  analysis  (Little, 
1978)  and  regression  with  grouped  or  censored  data  (Hasselblad,  Stead,  and 
Galke,  1980). 

The  EM  algorithm  converges  reliably,  but  it  has  slow  convergence 
properties  if  the  amount  of  information  in  the  missing  data  is  relatively 
large.  Also,  unlike  methods  like  Newton-Raphson  that  need  to  calculate  and 
invert  an  information  matrix,  EM  does  not  provide  asymptotic  standard  errors 
for  the  maximum  likelihood  estimates  as  output  from  the  calculations.  Its 


popularity  derives  from  its  link  with  maximum  likelihood  for  na  a  ^nd 


its  consequent  usually  simple  computational  form.  The  M-step  often 
corresponds  to  a  standard  method  of  analysis  for  complete  data  and  thus  can  be 
carried  out  with  existing  technology.  The  E-step  often  corresponds  to 
imputing  values  for  the  missing  data  ym#  or  more  generally,  for  the 
sufficient  statistics  that  are  functions  of  y„,  and  y  .  and  as  such  relates 
maximum  likelihood  procedures  to  imputation  methods.  For  example,  the  EM 
algorithm  for  multivariate  normal  data  can  be  viewed  as  an  iterative  version 
of  Buck's  (1960)  method  for  imputing  missing  values  (Beale  and  Little,  1975). 

Although  the  EM  algorithm  is  a  powerful  tool  for  estimation  from 
incomplete- data,  many  problems  remain.  For  example,  nonnorraal  likelihoods 
occur  more  commonly  with  incomplete  data  than  with  complete  data,  and  much 
remains  to  be  learned  about  the  appropriateness  of  many  incomplete-data 
methods  when  applied  to  real  data. 


Variables 


Observations 

i 


Figure  1.  Schematic  representation  of  a  monotone 
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